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It is easy to see the correspondence between the AGV scheduling and reinforcement learning. Assume 
that the AGV gets several requests to transport objects from one place to another. The AGV might choose 
to serve these requests in some order. Whenever a request is served, the AGV gets a reinforcement which 
might be inversely proportional to the time delay with which the request is served. The problem then is to 
learn an optimal policy (a mapping from states of the world to AGV actions) that results in the maximum 
expected reward rate. 

Although less straightforward, it is also possible to frame the space shuttle scheduling problem in 
the reinforcement learning context. In this context, the scheduler takes abstract scheduling actions such 
as moving tasks from one time slot to another, and from one machine to another. It gets a negative 
reinforcement proportional to the cost of the final conflict-free schedule. If the scheduler is able to correctly 
predict the cost of the final conflict-free schedule from intermediate states, then the search complexity can be 
reduced by choosing the scheduling action that will minimize the final cost of the schedule. The problem 
thus reduces to learning to predict the final cost of schedule from intermediate states with scheduling 
conflicts. 

Many of the reinforcement learning methods are based on on-line versions of dynamic programming. In 
dynamic programming, the cost or value of a state is computed by backing up the costs of the succeeding 
states. Instead of computing the value of a state once and for all from the values of all its neighbors, in 
reinforcement learning, the value of a state is incrementally updated, as and when its neighbor’s values are 
updated. Both the on-line and off-line versions of dynamic programming store a table of state- value pairs. 
The on-line version uses this table to choose an action that minimizes the expected cost of the final state. 

One of the problems with the table-based reinforcement learning techniques like Q-learning is their 
space complexity [WATKINS92]. In the worst case, they need tables as big as the entire state space and 
some more, which is unrealistic in nothing but the trivial of domains. This also translates to very slow 
convergence of learning, because most states in the state space have to be updated many times before the 
learner’s actions converge to optimal performance. 

Reinforcement learning researchers use supervised learning methods to store the tables compactly and 
to converge quickly to the correct policy. For example, the table of state- value pairs can be used to train 
a neural net which can then predict the values of states that have not been stored. Similarly we can 
also consider storing the state-value pairs without generalizing them and use approaches like the nearest 
neighbor algorithms to predict the values for the unseen states by interpolating between the stored states 
which are nearest to the unseen state. Another approach would be to approximate the state-value table 
by a set of peiecewise polynomial functions using methods like spline interpolation. Such “structural 
generalization” methods give rise to a compact storage of states as well as a quicker convergence when the 
function they are trying to approximate suits their structure. 

One of the problems in reinforcement learning is trading off exploration with optimal decision making. 
Exploration facilitates learning new knowledge while optimal decision making exploits old knowledge. 
However, most widely studied exploration strategies are random. We plan to investigate more sophisticated 
strategies that explore “near miss” states. These strategies seem effective for learning piece- wise polynomial 
functions with many locally irrelevant features. 

The rest of the paper is organized as follows. The next section introduces reinforcement learning. 
Section 3 introduces the NASA space shuttle payload processing domain and puts it in the framework of 
reinforcement learning. Section 4 does the same for the AGV scheduling domain. Section 5 discusses the 
structural generalization problem in reinforcement learning and proposes a number of solutions that we 
plan to implement. Section 6 discusses some of the other challenges that the scheduling domain offers to 
reinforcement learning, and some proposed solutions. The last section concludes with a summary. 
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2 Reinforcement Learning 


SUlt6d t0 a daSS ° f StOChastic ° ptimal contro1 Problems called Markovian 

Pr ^ lem T ^ deSCr5bed by a 4 ’ tUple < 5 ’ C >' S is a set of states 

and A is the set of actions. p td {a) ,s the probability for every state pair (L) and action a e A that 

Krrr,r ] r : p *■“* ,his 

action. ta a state „ at time „• som ; vi'dof 

or reward is discounted by multiplying it with a discount factor y*. where 

exnertpH^l ^ ls a ma PP in S fr° m tbe state i t at time t to a recommended action in A. denotes the 
expected value of the infinite-horizon discounted cost, given by: 


ni t ) = e, [££ o7 ‘+; Cl+i ] 


a r ys Mows p * y "• a - ° ptM is 

Various versions of dynamic programming I'DP'l mmn.ii 0 « , , , . p ' 1S °P tima b 

to previous states in different orders rBARTOQ^ R f , J* g * P C ° StS fr ° m the last state 

. ainerent orders IBART093], Reinforcement learning methods use on-line versions of 

tbim, p ;°rr; n f s k s :„:r « r r**-.***« *—«. * «, ^ , hat « h e :::z: 

IZT y u /^ r k Q-learmng ebminates the need for separately learning the p matrix bv 

acHoi ta “utSe ?„ rri fr °“ S,ate . actio ” P“ s to «■« discounted costs of taking 

action m that state. In particular, ,f ,s the current state and a, is the action taken, then 

Q(it,a t ) = £^ oT j <7 +j 

Since the value of a state V(i) is the value of the best action in that state, we can write 

V(i t ) = min Q(i t ,a) 

and, it follows that 

Q(h, a t) = c t + 7 V(i t +\) 

bJ^’rT ! be ? Va !- UeS ° f the final “ absorbin g ^ates” can be immediately calculated which are 

coinputlLT^ 

k, and o ,s the learning rate, Q„ is updated for states in S k from last to first as follows: 

Qk(h,a t ) = (1 - a)Qi_i(z t ,a < ) -f a(c< -f 7 V r fc _ 1 (^ +J )) j 
where, V k (i t ) = min a Q k (i t ,a ) 

STJT - “■>“> -• 


196 


3 The Space Shuttle Payload Processing 

The NASA Space Shuttle Payload Processing domain is an example of Job Shop Scheduling problem 
[ZWEBEN92]. Each ‘mission’ consists of a set of payload/carrier pairs, and a launch date. Each carrier 
requires a distinct set of tasks to prepare and process the payloads for a mission. The tasks are constrained 
by precedence and resource relationships. The resources are grouped into resource pools. The goal is 
to schedule all the tasks needed to load the carriers onto the orbiter, avoiding the resource contention 
problems, satisfying all precedence constraints while minimizing the total expected length of the schedule. 
More details can be found in [ZHANG93]. 

The Space Shuttle Payload Processing problem can be viewed as a state space search problem, where 
states are partial schedules with possible conflicts, and operators move from state to state by moving tasks. 
The problem is to find a conflict-free schedule of minimum length. Unlike in some other domains, there is 
a lot of flexibility in defining the operators in the scheduling domain. Individual tasks can be moved by a 
constant amount, or by an amount that depends on the availability of resources and the schedule of other 
tasks, or groups of tasks can be rescheduled using a single operator. One could also consider a hierarchy 
of abstract to more primitive operators. We plan to experiment with all these different options. 

The search control knowledge for scheduling is expressed as an evaluation function that estimates the 
discounted final cost of the schedule reachable from the current state. In reinforcement learning methods 
like TD{ A), this amounts to an estimate of f* [SUTTON88]. Q-learning estimates it from the state that 
results by applying a scheduling action to the current state. 

If the evaluation function is accurate, then it can be used to select the action that leads to the state 
with the least cost without search. When the evaluation function is only approximate, as is likely to be the 
case in complex domains like scheduling, it can be combined with look ahead search, as done in 2-person 
games like chess, to exploit the benefits of both knowledge and search. 

4 The AGV Scheduling Problem 

Automatic Guided Vehicles or AG Vs are increasingly being used in manufacturing plants to cut down the 
cost of human labor in transporting materials from one place to another. Optimal scheduling of AGVs is 
a non-trivial task. In general, there can be multiple AGVs, with some routing constraints, e.g., two AGVs 
cannot be on the same route fragment going in opposite directions. The AGVs might also have capacity 
constraints such as the total load and volume they can carry, and the total time they can work without 
recharging. 

The transportation requests are stochastic and hence cannot be planned for in advance. The AGV gets 
a reward whenever it successfully serves a request. The behavior of the AGV is random in the beginning, 
but as it accumulates knowledge of the request patterns and the transportation costs involved, we expect 
it to perform better in the sense of serving more requests in a given time. In the reinforcement learning 
context, this corresponds to maximizing the average reward per unit time rather than maximizing the 
discounted reward. We can also associate a non-uniform reward structure with the requests and give more 
reward for serving some requests and not the others. 

A learning AGV is very attractive in a manufacturing plant because the scheduling environment is 
constantly changing and it is hard to manually optimize the scheduling algorithm to each changed situation. 
A learning AGV would automatically adapt itself changes in its environment, be they are added machines, 
changes in the AGV routes, or changes in the request rates and patterns. 

Once again, we treat the AGV scheduling as a state space search, and treat the status of various 
requests and the AGV as a state. The best action to take at any time depends on the current state. The 
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optima] policy can be learned using methods like Q-learning. 

5 Structural Generalization in Reinforcement Learning 

One of the major issues in reinforcement learning is that of structural generalization This can also be 
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6 Research Issues 

The scheduling domain offers some interesting challenges to reinforcement learning. 
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function with many locally irrelevant features, then it may be possible to explore this search space more 
intelligently. For example, it may be possible to determine locally irrelevant features by generating “near 
miss” examples, which are examples which differ from a base example by exactly one feature in a small 
amount, but change the value of the function. The existence of a near miss example in a feature shows the 
local relevance of that feature. Generating near miss examples and determining the values of the evaluation 
function at these examples add the ability of intelligent exploration to reinforcement learning. 

7 Conclusions 

In this paper, we suggested that reinforcement learning can be usefully employed in scheduling domains to 
learn search control knowledge as well as to learn to do optimal real-time scheduling. The work reported 
here is preliminary and much remains to be done. We plan to implement the ideas reported in this paper, 
test them and report the results in the near future. 
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